chore: rename deprecated orchestrator config keys by mikasenghaas · Pull Request #2327 · PrimeIntellect-ai/prime-rl

mikasenghaas · 2026-04-19T16:09:33Z

Summary

Rename [orchestrator.sampling] → [orchestrator.train.sampling] and [[orchestrator.env]] → [[orchestrator.train.env]] across all configs
Rename max_tokens → max_completion_tokens in sampling sections
Drops reliance on the deprecated auto-translation emitted by the orchestrator config validator

Validation

Ran uv run rl @ <config> --dry-run on all 38 modified RL configs — no deprecation warnings
Validated the two orchestrator-only partial configs (configs/debug/orch.toml, configs/ci/integration/rl_multi_run/orchestrator.toml) via direct OrchestratorConfig.model_validate — no deprecation warnings

🤖 Generated with Claude Code

Note

Low Risk
Low risk: this is a mechanical rename of TOML config keys/fields to match the current orchestrator schema, with no functional code changes. Main risk is mis-typed keys causing configs to be ignored or validation to fail at runtime.

Overview
Updates training configs across configs/ and examples/ to stop using deprecated orchestrator keys.

Specifically renames [orchestrator.sampling] to [orchestrator.train.sampling], [[orchestrator.env]] to [[orchestrator.train.env]] (and similarly for orchestrator-only partial configs), and replaces max_tokens with max_completion_tokens in sampling sections.

^{Reviewed by Cursor Bugbot for commit 6ce2707. Bugbot is set up for automated code reviews on this repo. Configure here.}

Rename '[orchestrator.sampling]' -> '[orchestrator.train.sampling]', '[[orchestrator.env]]' -> '[[orchestrator.train.env]]', and 'max_tokens' -> 'max_completion_tokens' across all configs to remove reliance on the deprecated auto-translation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nt (#1) * feat(orchestrator): multi-actor debate env integration + tests Wire prime-rl's orchestrator to the multi-actor debate environment (forks/verifiers/verifiers/envs/debate*). Adds the orchestrator-side glue and the unit-test suite for the debate env's W/G/M scoring path. Source modules (src/prime_rl/orchestrator/): - multi_actor.py: orchestrator dispatch for multi-actor episodes - multi_actor_advantage.py: GRPO/RAE advantage computation across per-actor rewards, handles role-conditioned advantage attribution - multi_actor_bridge.py: trajectory ↔ training-batch bridge with two-table output (one row per actor step), no flattening - multi_actor_eval.py: eval-mode scaffolding for multi-actor rollouts - eval_utils.py: small adjustments to thread multi-actor state through the eval loop - vf_utils.py: small adjustments to surface the new env factory params (judge_client, judge_model, judge_max_retries, etc.) to verifiers load_environment - .gitignore: ignore .DS_Store noise Tests (tests/unit/orchestrator/): - test_debate_env.py: 216-test coverage of DebateEnv rollout, W/G/M scoring, F2 short-circuit, state['error'] capture via maybe_retry, composed JudgeRubric grader+matcher, latest-step authority, MCQ fast path, judge wrap_opponent viewer_role threading, verdict collision validation, metrics/error_info split - test_debate_fields.py: field extraction + scoring mode coverage - test_debate_prompts.py: prompt rendering + opponent_wrap viewer_role + judge template loading - test_multi_actor.py / test_multi_actor_bridge.py / test_multi_actor_e2e.py / test_multi_actor_eval.py: foundation multi-actor protocol coverage Critical regression guard: test_debate_env.test_score_rollout_captures_vf_error_from_grader — verifies vf.InvalidModelResponseError from a composed grader_rubric flows through _grade → _score_rollout_body → score_rollout's except vf.Error → state['error'] (for maybe_retry retry discovery) + state['metrics']['errored_rollout']=1.0 + state['error_info'] {error_type, error_phase}. Single backend call (no implicit retry at score_rollout level; retry layered correctly at run_group_attempt). Suite: 216 orchestrator tests + 3 fork-internal JudgeRubric tests = 219 passing, 0 failing. * fix 2.5 -> qwen (#2286) * fix 32-> 30 (#2287) * feat: set tool_call_parser default to 'auto' (#2285) * feat: set tool_call_parser default to 'auto' Changed the default value of tool_call_parser from None to 'auto' to enable automatic tool call parser detection from model name by default. This provides better out-of-the-box experience for users working with tool-calling models. * test: add unit tests for inference metrics collector Tests parsing, aggregation (sum/max/mean), counter rates, histogram latency, counter reset handling, server failures, and wandb logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert "test: add unit tests for inference metrics collector" This reverts commit 48eb049144b51ec9a2562358b398c2be46bc8eca. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: pre-download model weights in launcher (#2282) * feat: pre-download model weights in launcher instead of using HF_HUB_OFFLINE Remove hardcoded `HF_HUB_OFFLINE=1` from multi-node SLURM templates and instead pre-download model weights via `snapshot_download` in the rl/sft launchers before dispatching to local or SLURM execution. This ensures weights are cached on the shared filesystem before training starts, removing the need to manually pre-download models. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: pre-download model weights in launcher instead of using HF_HUB_OFFLINE Remove hardcoded `HF_HUB_OFFLINE=1` from multi-node SLURM templates and instead pre-download model weights via `snapshot_download` in the rl/sft launchers before dispatching to local or SLURM execution. This ensures weights are cached on the shared filesystem before training starts, removing the need to manually pre-download models. Also replace `format_time` with the verifiers-style two-unit display (e.g. "1h 30m" instead of "1.50h"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * refactor: move pre_download_model to trainer.model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: log cache path when model is already downloaded Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: remove redundant cache log from pre_download_model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: move pre_download_model import to module top Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: skip download and log cache path when model already cached Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert "feat: skip download and log cache path when model already cached" This reverts commit 5b2bac9f2bb9e22ae898fe54f66f48357931dc40. * chore: keep HF_HUB_OFFLINE=1 in SLURM templates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: context parallelism for NemotronH Mamba layers (#2231) * refactor(tests): relocate verifiers fork to sibling path, delete stub scaffolding Move forks/verifiers/ to ../verifiers/ and switch pyproject to an editable sibling install. Delete ~690 LOC of sys.path-injection + ModuleType stub scaffolding from 7 orchestrator test files; tests now use normal Python imports matching upstream verifiers conventions. - pyproject.toml: verifiers source git-pin → editable path "../verifiers" with inline doc explaining sibling clone requirement - _compat.py: try/except ImportError → importlib.util.find_spec guard (partial/broken transformers installs in training contexts still fail loud; cleanly-absent transformers in the fork venv takes the skip path) - test_debate_env.py: FakeClient promoted to real vf.Client subclass; retry-loop tests use real maybe_retry + monkeypatched wait_none; dead _reraise_error_from_state helper + stale "module-level stub" comments deleted; _VFResponse/_VFUsage/_VFResponseMessage aliases dropped - test_debate_prompts.py: _PROMPTS_DIR via importlib.resources (namespace- package & wheel-safe) - Run-command docs added to test_debate_env.py docstring (cross-linked from fields/prompts docstrings) Tests still require --noconftest because prime-rl's root conftest eagerly imports prime_rl.trainer.world (torch/distributed). Orthogonal, out of scope. Run (from fork venv): cd ../verifiers && uv run pytest \ /path/to/prime-rl/tests/unit/orchestrator/test_*.py --noconftest * Support runtime verifiers version override (#2274) * Support runtime verifiers version override via VERIFIERS_VERSION env var When set, the entrypoint reinstalls verifiers from the specified git ref (tag, branch, or commit) before starting the main process. * Drop --no-deps so transitive deps are updated with verifiers override * Use --reinstall-package to only reinstall verifiers, not the entire dep tree * fix: always ensure X-Session-ID and propagate extra_headers_from_state in elastic pool (#2283) Two fixes: - Use setdefault so X-Session-ID: example_id is always present for sticky DP-aware routing, even if user provides other extra_headers_from_state entries - Propagate extra_headers_from_state when rebuilding clients in the elastic pool, so session headers survive pool refreshes Keeps dp_rank_count as-is for direct DP rank routing. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Feat: fix cpu offloading patch to match upstream and remove a segfault (#2300) * test(debate_env): explicit members required at construction Exercises the new DebateEnv contract: empty/duplicate members raise, and len(self.members) replaces _count_actors as the round-index divisor. * fix(bridge): widen MemberRollout.example_id to int | str EpisodeResult.base_example_id is typed int | str upstream, but the bridge enforced int via _validated_example_id and TypedDict. Widen MemberRollout.example_id to int | str and drop the int coercion (keep the None check). * test(kernel): assert KernelProtocolError is raised (and is a vf.Error) Cover all three apply_action protocol-violation branches: wrong actor, duplicate submission, and post-finished submission. * fix(bridge): revert int|str widening — dataset and buffer still require int Gatekeeper (HIGH): widening MemberRollout.example_id to int | str was a local lie. verifiers.envs.environment._ensure_example_id coerces dataset rows to int and prime_rl.orchestrator.buffer.Buffer keys its example store by int. The first str id propagated through the bridge would blow up non-locally at buffer-insert with a confusing stack trace. Revert to int-only here and fail loud with a message pointing at the two downstream layers. Full int | str propagation (dataset + buffer + bridge together, with an integration test) is deferred to a follow-up. * test(debate_env): cross-checks for members drift + cosmetic cleanup Regression tests for the two cross-checks added in verifiers@d7ab4fb: - test_debate_env_members_must_match_rubric_members (order-sensitive) - test_debate_env_members_must_match_static_schedule_actors - test_debate_env_skips_schedule_cross_check_for_dynamic_program Also addresses auditor cosmetics: - hoist KernelProtocolError / vf.Error imports to module top - update stale docstring on test_kernel_rejects_wrong_actor * test(orchestrator): migrate debate tests to channel-split Utterance - Update Utterance fixtures to use raw_content/public_channel/private_channel. - Replace strip_think/redact_think contract tests with parse_channels contract (hard-fail on unclosed, stray, multiple, nested). - Replace unclosed-think privacy integration test with public_channel viewer check — leakage is now structurally impossible. - Add apply_action malformed-markup rejection test. - Fix attribution test schedule (add judge slot for members=[A,B,J]). - Remove test_mcq_think_tag_stripped — think handling no longer lives in mcq. * accept fully-qualified expert names in lora check (#2301) * accept fully-qualified expert names in lora check * ruff format * refactor(bridge): dual-read member rewards (structured -> flat fallback) Prefer state['member_rewards'][mid] (MultiAgentRubric contract). Fall back to legacy flat metrics['reward/{mid}'] with one-time deprecation warning per process. Structured key wins when both present. * refactor(advantage): extend RAE baseline key to (task, example_id, role_id) Partitions EMA baselines across envs — previously, two envs with overlapping example_ids would contaminate each other's role-conditioned baselines. 'task' sourced from MemberRollout['task'] (= env name). * test(rubrics): MultiAgentRubric contract + bridge dual-read + RAE task key - contract: subclass populates member_rewards/member_metrics/episode_metrics - score_group error boundary: KernelProtocolError in one rollout does not prevent scoring of other rollouts; defaults populated on failing state - non-vf errors propagate (programming bugs escape loud) - bridge prefers structured member_rewards, falls back to flat metrics - RAE baselines partition by task (different envs do not contaminate) * test(multi_agent_env): rollout, atomic commit, invariant, lineage cache 13 tests covering: - init validation (empty/duplicate members, stray overrides) - sequential rollout with correct member tagging + stop conditions - priority ordering (error > schedule_exhausted > prompt_too_long) - simultaneous slot atomic commit (all-or-none on mid-slot error) - monotonic build_prompt invariant across a 4-slot rollout - actor_overrides routing to per-member (client, model) - lineage-scoped prefix match: A's second turn hits A's cache, not B's * test(kernel): regression tests for native-think leak + quarantine - test_parse_channels_strips_native_think_with_custom_tag: with pack configured think_tag='reason', native <think>secret</think> never reaches public_channel and is NOT promoted to private_channel. - test_apply_action_quarantines_malformed_think_markup: malformed model output commits with parse_error flag instead of aborting; kernel-state violations (wrong actor) still raise. - test_rollout_survives_benign_prose_with_bracket_words: 'I will <think> and answer' parses as quarantined, schedule still advances, peer member still gets to speak. * test(kernel): assert exact whitespace contract in native-think strip test Replace weak or-chain (pub == 'public tail'.strip() or ...) with exact assertion pub == 'public tail'. Documents parse_channels' whitespace contract: block excision preserves internal whitespace, outer strip() only trims leading/trailing. * fix: work around transformers lazy_load_kernel offline regression (#2276) * fix(scheduler,bridge): narrow error catch + atomic reward schema Scheduler: The blanket 'except Exception' in _process_finished_task swallowed every non-CancelledError — MemoryError, AttributeError, KeyError from dataset corruption, KernelProtocolError, OverlongPromptError — and converted them to silent sample loss. Hiding these during a migration is exactly the opposite of what we want. Narrowed to the two error classes verifiers.utils.async_utils.maybe_retry considers retryable: vf.InfraError (incl. TunnelError, SandboxError, BrowserSandboxError) and vf.InvalidModelResponseError (incl. EmptyModelResponseError). Everything else propagates loud. Bridge: _resolve_member_reward worked per-member, which let a half-migrated rubric write structured for some members and flat for others on the same rollout, silently merging two schemas. Replaced with _resolve_reward_schema(members, ...) — atomic decision per rollout. If state['member_rewards'] is present it MUST cover every member; otherwise ValueError. Otherwise all members come from the legacy flat 'reward/{mid}' keys. Tests: - test_bridge_partial_structured_rewards_raises (was test_bridge_structured_missing_member_falls_back) — inverts the semantic: partial coverage now raises instead of mixing. - test_bridge_flat_missing_member_is_none — legacy flat path still tolerates missing keys (preserves pre-migration semantic). 303/303 tests pass. * test(multi_agent_env): TaskGroup cancellation + post-commit rollback Two new tests covering the HIGH findings from round 2: - test_simultaneous_slot_cancels_peer_on_first_failure: asserts peer actor never reaches its completion line when a sibling raises first (TaskGroup cancellation contract). - test_simultaneous_slot_rolls_back_on_post_commit_hook_failure: asserts state["_kernel"] stays at the pre-slot snapshot and trajectory remains empty when on_step_committed raises mid-slot. * test(debate_env): monotonic invariant + real-types e2e rollout Adds two structural tests for Phase 5's DebateEnv refactor: 1. test_debate_env_build_prompt_monotonic_across_slots -- asserts that for each member, the slot_{N+1} prompt is a byte-equal extension of slot_N's prompt. The prefix-cache path in the token client depends on this, and breaking it silently turns an O(T) episode into O(T^2). 2. test_debate_env_end_to_end_real_types_rollout -- drives a full rollout + score on the production selfplay prompt pack with no mocks on core types (DebatePrompts, FieldSpec, DebateRubric). Only the client is faked. Verifies trajectory tagging, reward, and completion. Also updates test_debate_complete_fires_when_schedule_exhausted to expect the inherited 'schedule_exhausted' stop-condition name now that DebateEnv inherits stop conditions from MultiAgentEnv. * test(debate_env): migrate shim call sites, drop zombie consolidate tests Mirror of the verifiers cleanup (c1ddf1d): * env.debate_complete(state) -> env.schedule_exhausted(state) * env._resolve_actor(x) -> env.resolve_actor(x) * delete _consolidate_messages import * delete test_consolidate_merges_contiguous_user_messages * delete test_consolidate_does_not_merge_system_messages The two dropped tests asserted behavior that no longer runs in production (build_prompt stopped calling the consolidator in the monotonic refactor). 94 tests pass, was 96. * vf bump (#2302) * fix: clean stale rollouts and broadcasts on fresh runs (#2304) Previously `clean_future_steps` only ran when resuming from a checkpoint, so a fresh run started in an output_dir containing stale rollouts or broadcasts from a previous run would consume them: the trainer would train on stale data and the orchestrator would compute a negative async level because it sees a trainer that is seemingly ahead of it. Run the same cleanup from step 0 when training from scratch so these artifacts are removed before training begins. * test(maenv): regression tests for fold / positional round_index / strict pack validation 10 tests covering: - fold_consecutive_user_messages: idempotence, SA tool no-op, tool-metadata preservation, multimodal content-list safety, merged-user metadata carry. - DebateEnv.build_prompt end-to-end: folded rollout prompts produce a single trailing user msg that _is_valid_env_tail accepts; prefix byte-equality between slot-N cache and slot-N+1 prompt. - DebateEnv positional round_index: sparse slot_ids (10, 20, 30, 40) render the same past-instruction text as contiguous (0, 1, 2, 3). - DebatePrompts._validate: rejects round_index in system, phase in question, accepts turn-invariant templates even when user block references per-turn vars. * test(maenv): drop hardcoded sys.path; tighten multimodal fold assertion Auditor flagged: - sys.path.insert with hardcoded /Users/joanvelja/... path — works only on the laptop, breaks CI. Dropped; the sibling-fork venv already has verifiers importable. - unused `import yaml`. Dropped. - test_fold_skips_multimodal_content_lists asserted only len(folded)==2, weak. Now asserts folded == msgs byte-for-byte and confirms the image_url structural part is preserved. * test(maenv): 6 regression tests for AST validator + per-member num_rounds AST validator bypass coverage (all were silent under the regex): - {% if is_first_round %} statement-tag bypass - {{ hints[round_index] }} index-access bypass - {% set r = round_index %} set-directive bypass - is_first_round variable (was missing from original list) Per-member num_rounds: - simultaneous schedule [AB, AB]: num_rounds == 2 per member (not 1) - asymmetric schedule: A=3 / B=2 (not 5//2=2 for both) * chore: bump vllm-router to v0.1.22 (#2292) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(multi_actor_advantage): use defaultdict for per-key aggregation Replace manual .get-or-default pattern in key_sums/key_counts with defaultdict. Iterate via .items() in the update loop instead of re-indexing by key. * refactor(bridge): drop flat-metrics fallback, require member_rewards Pairs with verifiers commit removing member_metrics/episode_metrics. Now that every rubric must write state['member_rewards'], the bridge's legacy fallback to metrics['reward/{mid}'] (and its one-time deprecation warn, module global, helper layer) is dead. _resolve_reward_schema → _resolve_member_rewards: one-shot lookup, raises on absence or partial coverage. No schema decision, no fallback. Test migration: - test_multi_agent_rubric: drop member_metrics/episode_metrics assertions; contract is now just member_rewards. - test_multi_actor_bridge: _make_rollout_output uses member_rewards parameter (was metrics with reward/{mid}). Dropped three legacy- fallback tests (falls-back-to-flat, flat-missing-is-None, prefers- over-flat) → replaced with one partial-coverage-raises contract test and one missing-member-rewards-raises test. - test_debate_env full-pipeline test patches member_rewards['J'] for the post-rollout injected judge step. * feat(buffer,bridge): accept int | str example_id end-to-end Buffer's isinstance check + example_buffer type signature widen to int | str. The dict keys int | str without any code change — Python hashes both cleanly. Bridge MemberRollout.example_id + _validated_example_id widen to int | str (previously int-only with a gate rejecting str). The gate-and-revert dance from earlier in this PR goes away now that the three layers (dataset, buffer, bridge) are consistent. Test migration: test_str_example_id_rejected_until_dataset_and_buffer_support_it → test_str_example_id_flows_through_bridge. The rejection semantic is now a positive test for propagation. Note: prime-rl venv is linux-only per lockfile, so the buffer-side torch-dependent integration tests can't run on Darwin. The type widen is structurally verified: isinstance check accepts both; dict keys on both; bridge round-trip test on a str id passes end-to-end through the non-torch layer. * fix: check rollout error before empty trajectory in scheduler (#2308) When `verifiers` CliAgentEnv catches an agent crash pre-LLM-call, it sets `state["error"]` but the trajectory stays `[]` because the agent never produced any messages. The previous branch order fired the "Empty trajectory" warning first and dropped the detailed AgentError diagnostic. Swap the branches so error-bearing rollouts surface "Rollout error ...: {error_chain_repr}" instead. Related: PrimeIntellect-ai/verifiers#1127, #1130 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(eval): restore per-rollout isolation + correct total_turns fallback Colleague review flagged three regressions in the initial MA commit: P1 pyproject verifiers source (already reverted to git pin). P2 eval failure semantics (vf_utils.py): the earlier change dropped _get_eval_inputs flattening and passed rollouts_per_example=K into generate(). That routes through env.run_group(), which uses asyncio.gather() WITHOUT return_exceptions=True and retries the whole K-group on any raise. One transient failure past max_retries dropped every rollout for that example, biasing pass@k / avg@k toward examples that never flake. Verified inertness before reverting: DebateRubric / MultiAgentRubric declare no GroupRewardFunc; multi_actor_eval groups on base_example_id post-hoc. The change enabled no active consumer, so reverting loses nothing currently used. Revert: keep _get_eval_inputs flattening upfront, pass rollouts_per_example=1 so each rollout is its own run_group call. Comment documents the trade-off for future comparative rubrics. P3 total_turns fallback (multi_actor_eval.py): len(r.members[0].trajectory) counted one participant's steps. An alternating A/B schedule under-reported by factor 2; A/B/J by ≈3. Fixed to sum(len(m.trajectory) for m in r.members). * fix: serialize env server spawn to avoid port race (#2310) get_free_port() only holds the port until it returns, so parallel env spawns under asyncio.gather could hand the same port to two children — the loser died with EADDRINUSE. Serializing start() and awaiting wait_for_server_startup() between envs ensures each port is bound before the next one is picked. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add FA4 (flash_attn.cute) support to ring attention, enabling context (#2307) parallel training with FA4 kernels. Mirrors the FA3 ring attention pattern (all-gather K/V, compute per GQA stride, reduce-scatter grads) using FA4 low-level _flash_attn_fwd/_flash_attn_bwd. Changes: - ring_attn.py: FA4 forward/backward wrappers, _RingFA4Varlen autograd Function, ring_fa4_varlen_func public API - attn.py: route FA4 to ring_fa4_varlen_func in substitute_ring_attn - trainer.py: allow CP with fa4 (requires model.impl='custom') * Fix Prime monitor public API flow (#2205) * Use bearer auth for Prime monitor uploads * Fix Prime monitor presign and finalize flow * Sanitize non-finite Prime monitor payloads * Simplify Prime monitor payload normalization * Simplify Prime monitor public API contract * Simplify public presign response parsing * Simplify non-finite payload sanitization * Inline public presign response parsing * Inline non-finite payload sanitization logic * Refine Prime monitor JSON sanitization * Address review: inline auth headers and simplify sanitize - Remove _api_headers() helper; store self._headers once in __init__ - Always sanitize payloads; drop silent try/except and log only when values are dropped - Remove prime_cli sys.modules mocking from tests (real dep is installed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: sami jaghouar <sami@primeintellect.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: remove prefix-cache-salt and reset-prefix-cache config flags (#2314) * chore: remove prefix-cache-salt and reset-prefix-cache config flags Hardcode the defaults: always set cache_salt on inference requests (keyed by ckpt_step) and never reset the prefix cache after weight or LoRA updates. The salt alone is sufficient to invalidate stale KV states across policy updates, so the reset path is redundant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: keep empty experimental sub-configs as extension points Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump verifiers pin to a036fce (includes v0.1.12 sync) Upstream verifiers main was merged into our feat/debate-env branch (github 'Sync fork' → merge main). Commit a036fce on joanvelja/verifiers brings in v0.1.12: - TITO tool-shape dummy assistant fix (stitcher defensive) - json_logging propagation to env workers - swebench root-logger hijack fix - tomllib/tomli py3.10 guard - CliAgentEnv dead-tunnel fix + AgentError double-wrap fix - NeMoRLChatCompletionsClient available as actor_overrides target - composable Task/Agent/Environment experimental (orthogonal to MA) 332/332 multi-actor tests green against new pin. No MA-path changes required — upstream surfaces (RLM, CliAgent, composable, CLI eval) are orthogonal to our MultiAgentEnv stack. * refactor(orchestrator): MARScore bridge + P0 fixes + dead Path-B removal Pairs with verifiers e04c8f5 (MARScore + MemberScore + factory rewiring). The bridge now reads the typed ``state["mar_score"]`` payload directly — dropped 5-key dict plumbing, schema drift is structurally impossible. Bridge - multi_actor_bridge: rewrite rollout_to_member_rollouts to read output["mar_score"] (verifiers.types.MARScore). Drops _resolve_member_rewards, _validated_example_id, _member_to_rollout, and the dead episodes_to_member_rollouts (Path-B push protocol). Auto-coerces dict -> MARScore via model_validate, so the wire format (in-memory object vs. JSON-round-tripped dict) is transparent. P0-2 quarantine masking - trajectories.interleave_rollout: check step["extras"]["parse_error"] and mask completion tokens (both make_sample + extend_sample paths). Previously only the global output["error"] gated masking, leaking malformed model tokens into training despite the kernel's per-utterance quarantine. Scheduler widen - scheduler: TimeoutError added to the retryable-transient catch alongside (vf.InfraError, vf.InvalidModelResponseError). The env server client raises built-in TimeoutError on recovery timeouts; those stalls should follow the same drop-and-refill path. - test_scheduler: regression test asserting a mid-group TimeoutError is dropped, the group state is cleaned, and the remaining rollouts proceed. Path-B graveyard (zero production callers; blockquote confirmed by grep across both repos) - Delete multi_actor.py (197 LOC) — run_episode / run_episode_group consumer of the MultiActorEnv Protocol. No implementation of the Protocol exists in either tree. - Delete multi_actor_eval.py (135 LOC) — evaluate_multi_actor_episodes consumes EpisodeResult (Path-B). Duplicates eval_utils._pass_at_k. - Retain multi_actor_advantage.py (RAE baselines). Path-B-tagged but reusable: MemberRollout-compatible, per-(task, example_id, role_id) partitioning — the obvious advantage path for MA training wiring. Annotation widened to tuple[str, int | str, str] to match the MemberRollout.example_id int|str contract end-to-end. Tests - test_multi_actor_bridge: fixtures rebuilt to construct RolloutOutput via the real state_to_output -> JSON round trip. Closes the test- fabrication hole that hid the original P0 (state["member_rewards"] silently dropped at serialization). - test_multi_agent_rubric: updated for MARScore contract; adds coverage that base rubric does NOT overwrite subclass's partial mar_score on vf.Error. - test_marscore_stress: 33 adversarial property tests across 10 sections (schema invariants, round-trip fidelity, SA fallback, dict/object bridge input, P0-1 ExceptionGroup flattening, P0-2 quarantine propagation, P0-4 fork/merge isolation, errored-rollout round-trip, schema enforcement, projection invariants). - test_multi_actor_advantage: dedicated suite for RAE (cold start, EMA, per-role/example/task baseline independence, ordering invariance, repeated-key mean update, str example_id). - test_debate_env / test_debate_prompts: migrated assertions to the new contract via inline _views helper (legacy-shape projection of mar_score for backwards test readability) and the DebatePrompts.__post_init__ verdict-token collision check now fires at pack construction (was in load_environment). pyproject: bump verifiers pin a036fce -> e04c8f5. * refactor(orchestrator): consume verifiers multi-agent bridge * refactor: unify actor→agent naming across orchestrator multi-agent modules Paired with verifiers 638504d (same rename + build_prompt decomposition + arch doc). Zero behavior change on the prime-rl side — mechanical consumer-side rename. - Rename multi_actor_advantage.py -> multi_agent_advantage.py (git mv) - Rename multi_actor_bridge.py -> multi_agent_bridge.py (git mv; still a thin compat shim that re-exports verifiers' rollout_to_member_rollouts and MemberRollout) - Rename test_multi_actor_* -> test_multi_agent_* (git mv) - Update imports: verifiers.envs.multi_actor_kernel -> multi_agent_kernel - Update field access: slot.actors -> slot.agents - Update identifier names: actor_overrides -> agent_overrides etc. - "member"/member_id/member_rewards unchanged — distinct roster-level concept - Bump verifiers pin: e04c8f5 -> 638504d 331 multi-agent tests pass unchanged. * feat: drop filtered rollouts instead of masking (#2277) * feat: drop filtered rollouts from training batch instead of masking Previously, enforced filters zeroed the completion_mask on detected rollouts but still sent them through the entire training pipeline. This wastes compute on samples that contribute nothing to the loss. Now, `apply_filters` returns the subset of rollouts that should be sent to the trainer. Enforced-detected rollouts are excluded before pretokenization, VLM cache building, and sample construction. The trainer handles the resulting empty batches ("phantom steps") by skipping forward/backward and logging `data/is_empty_batch`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: retry empty filtered batches instead of passing them to trainer Keep the invariant that the trainer only receives non-empty batches. If all rollouts are filtered out, regenerate the batch (up to 3 retries) and crash the orchestrator on sustained failure. Warn at <=10% trainable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: drop redundant num_rollouts guard The retry loop only breaks when len(filtered_rollouts) > 0, which implies num_rollouts > 0, so the guard is unreachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: expand low-trainable-ratio warning with env review hint Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: drop empty-df guard and inline filtered metrics filtered_rollouts is guaranteed non-empty after the retry loop, so the empty-df branch is unreachable and the intermediate locals add no value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: hoist MAX_EMPTY_BATCH_RETRIES to module scope Also rename the loop var and warning message to "retry N/MAX" so the counter excludes the initial attempt and reads less ambiguously. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: adjust log style in filter retry warnings Drop trailing periods and replace ";" with " - " as clause separator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: clarify low-trainable-ratio warning hint Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: compute metrics over all rollouts, drop only from trainer Metric logging reverts to main's semantics: all rollouts contribute to prefill_len, decode_len, samples_per_rollout, and results_df. Filtered rollouts are still pretokenized and interleaved, but their samples are simply not added to train_examples. Also inline the generate_batch coroutine since it is awaited immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: move filter flags to rollout["filter"] + is_filtered Per-filter detection booleans now live under rollout["filter"], and a top-level rollout["is_filtered"] captures whether any enforcing filter triggered. The orchestrator uses is_filtered directly as the keep gate (no more id() mapping). apply_filters no longer returns filtered_rollouts - the in-place flags are the single source of truth. Also unbound-var fix for retry-loop locals, and per-env filter/<env>/<flag>_rate logging that mirrors the metrics logging pattern. Both new fields are serialized to train_rollouts.jsonl via save_rollouts, which already writes all top-level rollout keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: cap to 3 total batch-generation attempts, not 3 retries Rename MAX_EMPTY_BATCH_RETRIES to MAX_EMPTY_BATCH_ATTEMPTS and have the loop run exactly that many times. Warning now reports the attempt that just failed ("Attempt N/3 ... retrying"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: log error line before raising on exhausted retries Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: align filter metric key names with per-env logging Rename filter/total_detected_rate -> filter/detected_rate and filter/total_enforced_rate -> filter/is_filtered_rate so the overall keys mirror the per-env filter/<env>/is_filtered_rate naming. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: unify filter logging under filter/{all,<env>}/{<filter>,is_filtered} Move is_filtered into results_df so it can be aggregated per-env like is_truncated. filter_df now holds just per-filter detection booleans. apply_filters no longer returns an aggregate metrics dict - the orchestrator derives the rates uniformly across the "all" and per-env scopes, with symmetric key naming and no _rate/_count suffixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: rename rollout["filter"] to rollout["filters"] + log keys Aligns with the plural configs list and the rollout-level "filters" namespace. Log keys change from filter/{all,<env>}/... to filters/{all,<env>}/.... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: self-evict orchestrator when batches carry no learning signal Write control/evicted.txt before raising, so the multi-run manager skips the run on rediscovery instead of treating it as a hard crash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * update dependency (#2317) Co-authored-by: Mika Senghaas <mail@mikasenghaas.de> * test(maenv): update fold contract tests to typed Messages verifiers' fold_consecutive_user_messages narrowed from (Messages | list[dict]) → list[dict] to: Messages → Messages — typed in, typed out, with model_copy preserving extras (e.g. OpenAI `name` field under CustomBaseModel extra="allow"). Tests updated to construct typed UserMessage / SystemMessage / AssistantMessage / ToolMessage inputs and assert via attribute access (m.content, m.role) instead of dict indexing. End-to-end roundtrip test simplified: _is_valid_env_tail's _get_role helper accepts both attr and key access, so we pass typed messages straight through without model_dump. * chore: rename deprecated orchestrator config keys (#2327) Rename '[orchestrator.sampling]' -> '[orchestrator.train.sampling]', '[[orchestrator.env]]' -> '[[orchestrator.train.env]]', and 'max_tokens' -> 'max_completion_tokens' across all configs to remove reliance on the deprecated auto-translation. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(multi_agent): align bridge/advantage to verifiers α-cut API Bumps verifiers pin from 638504d → b723fda. Verifiers' α-cut deleted role_id as a redundant duplicate of member_id (the dual labeling poisoned RAE baseline buckets when MemberScore.role_id and MemberRollout.role_id diverged on errored rollouts). prime-rl now follows the cut end-to-end. API alignment: * MemberScore / MemberRollout / TrajectoryStep extras drop role_id * DebateEnv constructor drops role_for_agent kwarg (pack prompts key by member_id directly) * DebateRubric kwarg truth_role → truth_member * rollout_to_member_rollouts(output) — env_name positional dropped; bridge no longer overwrites output["task"] * MARScore.to_wandb_flat() → to_metrics_flat() * Errored MARScore episode_metrics is now {"errored_rollout": 1.0} only; error_type / error_phase moved to MARScore.episode_error * MultiAgentEnv._flatten_exception_group removed (asyncio.TaskGroup replaced by asyncio.wait — no flattening needed) * DebateRubric._count_parse_errors removed; counting now lives in member_snapshot which returns parse_errors as part of a per-member dict * DebatePrompts.wrap_opponent / build_context kwargs viewer_role/role_id → viewer_id/member_id * DebateRubric.judge_client lazy: construction succeeds without it; verdict() raises at score time. _grade/_match collapsed into verdict() raising vf.Error (not RuntimeError) src changes: * multi_agent_advantage.RAEKey docstring + key construction: (task, example_id, role_id) → (task, example_id, member_id) Test changes (updates, no deletions of behavior coverage): * Member naming restructured: env-rollout tests use members= ["prover","verifier"]; rubric/score-time tests use ["debater_a", "debater_b","judge"] so member_ids match prompt-pack keys directly * Stale-behavior tests repurposed to assert the new fail-loud / captured-error / no-overwrite contracts (e.g. test_round_trip_preserves_role_id → test_round_trip_preserves_member_id_assignment) * test_bridge_raises_on_missing_sampling_args → repurposed to assert that omitted temperature defaults to 1.0 (sampling_args is now always projected as {} by state_to_output) * Loser zero_sum_reward asserted as -1.0 (was 0.0 — current zero_sum_reward is winner+1 / loser-1 / judge 0 / tie 0) * Tests covering removed eager judge_client validation gates flipped to assert score-time verdict() failure instead 330 / 330 collectable orchestrator unit tests pass. * fix(multi_agent_advantage): SPIRAL Alg.1 ordering — update EMA before subtract Previous code did subtract-then-update with per-batch mean aggregation: for τ in B: A(τ) = R(τ) - b b ← α·b + (1-α)·mean({R(τ)}) SPIRAL Alg.1 (arxiv:2506.24119, lines 18-22, verbatim): for (τ, G_i) ∈ B do for p ∈ {0, 1} do b_{G_i,p} ← α·b_{G_i,p} + (1 - α)·R_p(τ) [line 20] A_{G_i,p}(τ) ← R_p(τ) - b_{G_i,p} [line 21] Per-trajectory, update-then-subtract. Each rollout's advantage is computed against the baseline that has just absorbed its own reward; sequential rollouts sharing a key compound through the EMA recursion rather than collapsing to a single mean update. Numerical impact (cold-start, momentum=0.9): OLD NEW single R=1.0 A=1.0 A=0.9 (=α·R) rep-key [1.0, 0.0] A=[1, 0] A=[0.5, -0.25] (mom=0.5) end baseline 0.25 0.25 (same in this case) For sequential batches the divergence compounds: at α=0.9, after 20 rounds of R=1, OLD baseline=0.878 (advantage 0.122), NEW baseline=0.878 (advantage 0.122) — converges asymptotically. The within-batch ordering invariant the previous implementation relied on no longer holds: see test_within_batch_ordering_compounds_per_trajectory. Tests updated (5): * test_cold_start_advantage_equals_reward → ..._is_reward_minus_post_update_baseline (asserts α·R = 0.9 instead of R = 1.0) * test_second_batch_uses_updated_baseline (asserts [0.9, 0.81] instead of [1.0, 0.9]) * test_within_batch_ordering_invariant → ..._compounds_per_trajectory (asserts that order DOES matter — distinct end baselines) * test_repeated_key_in_batch_uses_mean_for_baseline_update → ..._compounds_per_trajectory (asserts per-trajectory recursion, no mean aggregation) * test_zero_reward_from_errored_rollout_keys_correctly (A=-0.35 instead of -0.7 — baseline is updated before the subtract) Other 8 tests unchanged: cold-start single-key, distinct keys (per- member, per-example, per-task), str example_id, none reward, empty batch, degenerate group, baselines_update_after_batch. 13 / 13 advantage tests pass; 330 / 330 collectable orchestrator tests pass. * feat(ckpt): persist RAEState alongside progress + buffer CheckpointManager.save / load now accept an optional rae_state: RAEState | None. When set, the EMA baselines + momentum are serialized to rae_state.pt next to progress.pt; when omitted, no file is written. On load with rae_state set but file missing, we FileNotFoundError loudly rather than silently cold-starting — discarding EMA history mid-run is the kind of "training looks fine but has invisibly worse variance" bug the no-silent-fallbacks rule exists to prevent. Single-agent runs are unaffected: callers that pass rae_state=None (the default) get the original save/load behavior with no rae_state.pt written or expected. Test: round-trip + missing-file + omit-on-save (3 cases). Skipped on Darwin where torch isn't importable from the verifiers venv we run from — runs cleanly on Linux with prime-rl's full deps. * feat(orchestrator): route multi-agent rollouts through RAE per-member path Detects MultiAgentRubric on the env group at startup and branches the per-step training pipeline: episode rollout (1 per inference call) ├─[single-agent]→ compute_advantages (GRPO) → 1 training unit └─[multi-agent]──→ rollout_to_member_rollouts (verifiers bridge) ↓ drop judge member (config.rae.drop_judge=True default) ↓ compute_rae_advantages (SPIRAL Alg.1) ↓ N training units (one per member) Both paths feed into the same downstream pretokenize → interleave_rollout → TrainingSample assignment. Per-rollout metrics (results_df) preserve single-agent shape — per-unit token counts fold back via a ``rollout_to_unit_idxs`` mapping. Guardrails: * mixed MA + single-agent envs in one EnvGroup → NotImplementedError (different per-step branching, defer hybrid until a real use case shows) * MA + VLM → NotImplementedError (image cache key fan-out unimplemented) * RAE state lifecycle: instantiate at startup, persist via ckpt.save, restore via ckpt.load on resume (rae_state.pt round-trip) * Judge filter is opt-out (config.rae.drop_judge=True default) — judge has reward=0 by zero_sum_reward construction, training those tokens burns gradient compute on policy-neutral noise New config: ``rae: RAEConfig`` with ``momentum`` (Alg.1 α decay, default 0.9) and ``drop_judge`` (default True). Single-agent runs ignore it. New helper: ``fan_out_for_multi_agent(rollouts, drop_judge) -> (units, rollout_to_unit_idxs)`` extracted from the orchestrator inline so the fan-out logic is independently testable. 5 fan-out unit tests cover judge-drop, judge-keep, multi-rollout index mapping, end-to-end pipe into compute_rae_advantages, and empty-batch. Stage 3 follow-ups (separate PRs, not blockers for this wiring): * verifiers-side ``agent_overrides_resolver`` for per-episode learner seat assignment (gates first training run) * prime-rl filter to keep only ``member_id == row["learner_seat"]`` units (depends on the verifiers PR landing) 335 / 335 collectable orchestrator tests pass. Wiring change: 174 LOC (orchestrator.py: 122, advantage helper: 33, config: 34, minus 15 removed lines) — well under the briefing's 300-LOC bail-out. * fix(orchestrator): bind use_rae before VLM gate; persist RAEState in final ckpt Two bugs caught in Codex review of the multi-agent wiring: P1 (BLOCKER, every launch): the ``if use_rae and is_vlm`` guard at line ~146 read ``use_rae`` before the MA detection block at line ~220 assigned it. Python's local-scope rule promotes ``use_rae`` to local throughout the function as soon as ANY assignment exists, so the earlier read raised ``UnboundLocalError`` on EVERY orchestrate() invocation — single-agent and multi-agent alike. Moved the VLM+MA gate inside the ``if use_rae:`` block where ``use_rae`` is bound. P2 (data loss on resume): the final ``ckpt_manager.save`` after the loop didn't pass ``rae_state=``. Multi-agent runs that finished on a non-interval step wrote a checkpoint without ``rae_state.pt``; resume from that checkpoint then hit the load-side FileNotFoundError that ckpt.py raises by design (no silent cold-start). Added the kwarg. Static AST invariants test added — three properties caught both bugs without needing the heavy orchestrate harness: * use_rae: first Load (by source line) ≥ first Store * rae_state: same invariant * every ``ckpt_manager.save / load`` call passes ``rae_state=`` These trigger on the bytecode shape, not behavior, so they catch the class of bug at parse time. ``ast.walk`` is BFS, not document-order, so the test takes ``min`` of all line numbers per ctx rather than ``first encountered`` — initially passed P1 spuriously because the deeper Load node was visited later than the shallower Store node. 339 collectable orchestrator tests pass + 1 skipped (torch-gated). * refactor(advantage): unify [rae] into [advantage] union; split [multi_agent] for routing Surfaces the orthogonality of pipeline stages that the previous shape conflated. RAE is a baseline-subtraction layer (stage 3); MA fan-out is routing (stage 2); loss is a separate function in the trainer (stage 5). The previous ``[rae]`` block at the top level made it look like RAE was a coupled "MA path" — it isn't. RAE composes with any loss; you can run SPIRAL EMA + asymmetric IPO clip + length-shaped reward independently. Config surface (was → is): [rae] [advantage] momentum type = "ema_per_member" ← discriminator drop_judge momentum [multi_agent] drop_judge The advantage discriminated union now has three variants: type = "default" GRPO group-mean baseline (single-agent only) type = "ema_per_member" SPIRAL Alg.1 EMA per (task, ex, member_id) type = "custom" import_path + kwargs Cross-validation at orchestrator startup (pydantic can't see the rubric): * MA env + type="default" → ValueError (samples_per_problem grouping ambiguous after fan-out) * SA env + type="ema_per_member" → ValueError (member_id key meaningless) * MA env + type="custom" → permitted (user's responsibility) Orchestrator changes: * ``use_rae`` → ``is_ma`` (gates stage 2, not stage 3) * ``rae_state`` → ``advantage_state`` (generic — placeholder for any stateful estimator we add later; currently only RAEState lives there) * Per-step branching: stage 2 (fan-out) is independent of stage 3 (advantage). The dispatch ``if advantage_type == "ema_per_member"`` picks the per-unit estimator vs the flat-rewards GRPO/custom path. * drop_judge moved from ``config.rae.drop_judge`` to ``config.multi_agent.drop_judge`` — it controls fan-out filtering, not baseline computation. Static invariants test refactored to a parametrizable helper; added checks for ``advantage_type`` and ``advantage_state`` to catch the same class of UnboundLocalError that bit ``use_rae`` (P1 in commit 1e013eee0). Net change: 340 / 340 tests pass + 1 skipped. No behavior change for single-agent runs; multi-agent runs that previously used ``[rae]`` need ``[advantage] type = "ema_per_member"`` + ``[multi_agent]`` instead. Greenfield repo, no compat shim. * feat(slurm): cleanup stale node-local state before launch (#2331) * feat(slurm): cleanup stale node-local state before launch Add a pre-workload srun step to the multi-node RL, multi-node SFT and inference sbatch templates. It runs once per node and: - kills orphan python/torchrun/vllm/prime_rl processes left over from a prior job that wedged after scancel (SLURM doesn't always reap cleanly when a job sits in CG for hours) - removes stale vLLM and torch IPC state under /dev/shm/vllm-*, /tmp/vllm-*, /tmp/torch-*, /tmp/torchelastic_* Without this, decode engines on previously-used nodes can hang at "Waiting for READY message from DP Coordinator" because the new vLLM process finds a stale /dev/shm segment or port holder from the dead run. Symptom we hit: a fresh job timing out after 1800s because 4 decode engines never became READY; a manual pdsh cleanup of the same nodes fixed it immediately. Each node prints one line (hostname, residual proc count, total GPU memory in use) so the sbatch log shows the nodes came up clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(slurm): explicitly cover vllm-router in cleanup Address review feedback: add vllm-router to the pkill list and the procs-count regex so the intent is explicit, even though the broader "vllm" patterns already match it as a substring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(slurm): also kill prctl-named vllm::router workers pkill -f only matches the command line, so the vllm router's worker processes — which set their kernel process name (comm) to "vllm::router" via prctl but keep a different cmdline — slip through. Add process-name pkill for "vllm" and "vllm::.*" to catch them. Also broaden the post-cleanup procs count to look at both comm and args (ps -eo comm,args) so we see these if any survive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add conservative testing guidelines to AGENTS.md (#2330) * configs: add gpqa_{rlvr,debate,consultancy} recipes Three protocol comparisons on the same dataset (GPQA Diamond), same model size (Qwen3-4B), same eval — what changes is where the reward signal comes from: recipe reward source advantage ────── ───────────── ───────── gpqa_rlvr/rl.toml verifier (exact letter match) default GRPO gpqa_debate/ judge (winner-take-all) ema_per_member rl_selfplay.toml (SPIRAL Alg.1) gpqa_consultancy/ judge (picks assigned answer) default GRPO rl.toml The three are designed for direct A/B comparison: identical model, batch size, sampling temperature, eval cadence. The diff is one [advantage] block (or its absence) and the [[orchestrator.env]] id. Status: * gpqa_debate.rl_selfplay: works today against existing verifiers/environments/gpqa_debate package * gpqa_rlvr + gpqa_consultancy: require new env packages in verifiers (sketches in environments/gpqa_rlvr and environments/gpqa_consultancy on a paired commit there) Configs/ is informational per the README; not test-validated. * test(debate_env): align packs with new schedule×prompts coverage check verifiers commit 44f875e1 added an init-time cross-check on DebateEnv: every (member_id, phase) in a StaticSchedule must have a matching template in the prompts pack (system / question / user[member][phase] or user[member]['default'] fallback). Several existing tests built intentionally-incomplete packs and relied on the silent-no-instruction failure mode the check now rejects. Updates: * DEBATE_PROMPTS top-level fixture: add opaque-label aliases (A, B, X, Y) for kernel-level cross-check tests that exercise members= validation against prover/verifier-keyed packs, and a 'default' user phase for prover/verifier so phase-specific schedule overrides (simultaneous etc.) don't trigger the new check. * _make_think_prompts: add 'default' user phase fallback per member — these tests are about think-visibility / format_history, not instruction rendering. * _open_ended_prompts / _judgeless_prompts: add judge keys (system + question + user.final). The "judgeless" name refers to the absence of a judges= dict, not the absence of a judge participant — the canonical _SCHEDULE_SLOTS *does* schedule a judge agent. * _make_field_prompts: add verifier user templates + 'default' phase fallbacks so field-extraction tests work with any schedule. * test_format_history_attributes_both_debaters_distinctly: add per-member default user templates (test is about wrap-template attribution, not user-instruction rendering). * test_num_rounds_is_per_member_under_asymmetric_schedule: replace phase 'closing' (not in selfplay.yaml pack) with 'critique' — this test asserts on slot counts per member, not phase semantics. 340 / 340 + 1 skipped collectable orchestrator tests pass against the new verifiers HEAD. * chore: bump verifiers pin b723fda → 42a965e Captures the two fork PRs that just landed on joanvelja/verifiers main: f4de712e feat(envs): add gpqa_rlvr (single-agent RLVR) + gpqa_consultancy 78533ea7 fix(debate): validate effective prompt instruction coverage (the schedule×prompts init-time check) 42a965e3 Merge GPQA baseline environments (HEAD) Both were authored in this PR's branch stack (companion verifiers-side commits). This final bump on the prime-rl branch makes the MA wiring, new configs, and new env packages depend on a reproducible upstream SHA rather than a moving HEAD. Re-validated: 340 orchestrator unit tests pass + 1 skipped (torch-gated ckpt round-trip) against the new verifiers HEAD via the verifiers venv with prime-rl installed editable + --noconftest. No behavior change. * chore(tmp): zebra pass@N headroom probe for Isambard vLLM pass@{1,8} probe on Qwen3-4B-Instruct over 3x3/4x4 zebra buckets, with Slurm wrapper and format-sanity sample. Parquet stays local. * chore: signpost LoRA-self vs base pre-flight smoke for first GPU run Three-layer signpost so the smoke is unmissable when the next session loads on a GPU for the first learner-vs-fixed debate training run in the LoRA-self topology (single vLLM hosting learner adapter + base). 1. skills/preflight-lora-smoke/SKILL.md Auto-surfaces to agents working on "LoRA", "external opponent", "first GPU run", "enable_lora", "load_lora_adapter" contexts. Documents the three failure modes the web search turned up on vLLM 0.19 and how to interpret probe failures. 2. scripts/preflight_lora_smoke.py Executable, ~200 LOC, three probes with PASS/FAIL output: - mixed-batch correctness (base and adapter coexist in one batch) - hot-swap idempotence (the #18372 probe: 3rd+ swap dropping) - per-request perf delta on LoRA-enabled server (#10898 tax) Non-zero exit on any failure; tells the operator to fall back to the two-instance topology if triggered. 3. Stage-3 plan-doc stanza pointing at the skill + script, scoped specifically to the LoRA-self variant (external-API-opponent path is unaffected and needs no pre-flight). Motivated by vllm-project/vllm issues 18372, 33791, 10898, 10062, 10617, 7977 surfaced during feasibility research. The pattern is architecturally supported (NeMo-Aligner ships it for DPO/IPO; vLLM docs document it) but under-exercised in prime-rl specifically. Not a behavior change. No test additions -- the script itself IS the test, gated behind live GPUs which aren't available from CI. * chore: bump verifiers pin 42a965e -> 35826af (PR #4 squash) Picks up the agent_bindings_fn feature from joanvelja/verifiers#4: state-aware per-member (client, model) routing on MultiAgentEnv, gpqa_debate external-opponent branch with learner_seat policy + pin, shared-vLLM / LoRA-self topology support, runtime bindings validation. Unblocks Task #11 (prime-rl learner_seat MemberRollout filter) to start reading output.info["learner_seat"] set by the env-pack. * feat(orchestrator): filter MemberRollouts by learner_seat Stage 7 of the external-opponent debate pipeline. The verifiers-side (PR #4) stamps info.learner_seat per row when opponent_model is set; this side filters the fan-out so the frozen opponent's and judge's trajectories never reach the trainer. Changes: 1. fan_out_for_multi_agent gains `filter_by_learner_seat: bool = False`. When True, reads rollout.info['learner_seat'] and keeps only that member's unit. Missing info.learner_seat raises -- enabling the filter on a self-play env is a config mismatch, not a silent no-op. 2. MultiAgentConfig.filter_by_learner_seat: bool = False (new). Described in Pydantic Field so the TOML comment is auto-generated. 3. Orchestrator threads the knob into the fan-out call and the startup log line. No new validation gate -- the fan-out's runtime raise already fails loud on misconfigured envs. 4. Two new tests mirroring the existing drop_judge pair: filter=True keeps only the seated member; filter=True + missing info raises. 5. configs/gpqa_debate/rl_external_opponent.toml -- runnable config for the two-server topology (learner on orchestrator vLLM, opponent + judge on api.openai.com). Eval pins seat A for determinism across checkpoints. Comments at top point at the LoRA-self variant and the preflight smoke it requires. Cannot run tests locally (prime-rl lockfile is Linux-only); CI will. * fix(orchestrator): address two Codex P1s on MA path Two real bugs surfaced by Codex review of the MA fan-out path: 1. Custom advantage in MA mode silently corrupts gradients. The validation at line 225 correctly rejected advantage.type='default' for MA envs with the exact reasoning that compute_advantages' fixed- size reshape mixes seats/episodes under fan-out interleaving -- but allowed advantage.type='custom' through to the same broken code path. Same latent hazard for advantage=None. Tighten to "MA requires ema_per_member"; delete the dead else branch that would have called compute_advantages on the interleaved fan-out list. 2. Training-usage billing overstated by filtered-unit tokens. The MA fan-out refactor split "produce samples" from "filter samples": apply_filters marks unit['is_filtered'] without removing the unit, process_unit still returns samples for filtered units, and the accumulation loop tallied their tokens into num_prefill_tokens / num_decode_tokens before the train_examples.append gate. Those totals feed usage_reporter.report_training_usage(usage_type="training", tokens=...), so filtered rollouts were billing training that never happened. Gate token accumulation on is_filtered; leave rollout_total_samples alone since that's a "samples generated" count, which correctly includes filtered. Behavior changes on intended configs: none -- no recipe in-tree uses custom+MA, and the filtered-token undercount moves the billing number toward the truth, not away. --------- Co-authored-by: samsja <55492238+samsja@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Mika Senghaas <mail@mikasenghaas.de> Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: JannikSt <JannikSt@users.noreply.github.com> Co-authored-by: Matej Sirovatka <54212263+S1ro1@users.noreply.github.com> Co-authored-by: rasdani <73563550+rasdani@users.noreply.github.com> Co-authored-by: Jupiter <jupiterz@umich.edu> Co-authored-by: Dominik <me@dominikscherm.de> Co-authored-by: sami jaghouar <sami@primeintellect.ai>

mikasenghaas requested a review from samsja April 19, 2026 16:10

mikasenghaas marked this pull request as ready for review April 19, 2026 16:10

samsja approved these changes Apr 19, 2026

View reviewed changes

mikasenghaas merged commit d2718f5 into main Apr 19, 2026
17 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: rename deprecated orchestrator config keys#2327

chore: rename deprecated orchestrator config keys#2327
mikasenghaas merged 1 commit intomainfrom
chore/rename-deprecated-config-keys

mikasenghaas commented Apr 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mikasenghaas commented Apr 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikasenghaas commented Apr 19, 2026 •

edited by cursor Bot

Loading